Refactor interally to `RawBatch` and `CleanBatch` wrapper types #57

kylebarron · 2024-06-03T08:11:17Z

I was adding tests for parse_stac_ndjson_to_arrow when I discovered that that was inadvertently broken in #53 due to a refactor in the schema inference. It has been confusing to work with untyped pyarrow.RecordBatch classes because they're a black box, and we have two distinct data schemas we're working with: one that is as close to the raw JSON as possible, and another that conforms to our STAC GeoParquet schema.

This PR refactors this internally by adding RawBatch and CleanBatch wrapper types (open to better naming suggestions, but these are not public, so we can easily rename in the future). They both hold an internal pyarrow.RecordBatch but RawBatch aligns as much as possible to the raw STAC JSON representation, while CleanBatch aligns to the STAC-GeoParquet schema.

These wrapper types make it much easier to reason about the shape of the data at different points of the pipeline.

Change list

Add more tests for STAC conversion via arrow and ndjson
Create RawBatch and CleanBatch for more reliable internal typing
Add parquet tests

bitner · 2024-06-03T13:22:40Z

Rather than Raw and Clean, I'd rather something that was more explicit. StacArrowBatch and StacJsonBatch perhaps?

kylebarron · 2024-06-03T13:45:13Z

I had considered JsonBatch first, but I was trying to think of a different name because it's not a collection of JSON objects. And similarly StacArrowBatch isn't that descriptive because both batches are just Arrow underneath

TomAugspurger · 2024-06-03T18:44:16Z

I’ll be out this week so feel free to merge whenever you’re ready. On Jun 3, 2024, at 11:12 AM, David Bitner ***@***.***> wrote: @bitner approved this pull request. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because your review was requested.Message ID: ***@***.***>

kylebarron · 2024-06-04T16:46:42Z

I renamed to StacArrowBatch and StacJsonBatch

kylebarron added 7 commits June 2, 2024 18:28

Move json equality logic outside of test_arrow

54d625c

Refactor to RawBatch and CleanBatch wrapper types

a0433c5

Move _from_arrow functions to _api

7fabc9a

Update imports

46295d0

fix circular import

fa226d4

keep deprecated api

cc7beec

Add write-read test and fix typing

6060644

kylebarron requested a review from TomAugspurger June 3, 2024 08:16

add parquet tests

4c5d08b

kylebarron mentioned this pull request Jun 3, 2024

Include proj:geometry column in GeoParquet metadata when writing #56

Merged

fix ci

14a6bc9

kylebarron mentioned this pull request Jun 3, 2024

Write to Delta Lake #58

Merged

bitner approved these changes Jun 3, 2024

View reviewed changes

Rename wrapper types

7b83081

kylebarron merged commit cf0699b into main Jun 4, 2024

kylebarron mentioned this pull request Jun 4, 2024

Rename raw/clean batch to json/arrow batch #59

Merged

kylebarron deleted the kyle/batch-typing branch June 25, 2024 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor interally to `RawBatch` and `CleanBatch` wrapper types #57

Refactor interally to `RawBatch` and `CleanBatch` wrapper types #57

Uh oh!

kylebarron commented Jun 3, 2024 •

edited

Loading

Uh oh!

bitner commented Jun 3, 2024

Uh oh!

kylebarron commented Jun 3, 2024

Uh oh!

TomAugspurger commented Jun 3, 2024 via email

Uh oh!

kylebarron commented Jun 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Refactor interally to RawBatch and CleanBatch wrapper types #57

Refactor interally to RawBatch and CleanBatch wrapper types #57

Uh oh!

Conversation

kylebarron commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change list

Uh oh!

bitner commented Jun 3, 2024

Uh oh!

kylebarron commented Jun 3, 2024

Uh oh!

TomAugspurger commented Jun 3, 2024 via email

Uh oh!

kylebarron commented Jun 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Refactor interally to `RawBatch` and `CleanBatch` wrapper types #57

Refactor interally to `RawBatch` and `CleanBatch` wrapper types #57

kylebarron commented Jun 3, 2024 •

edited

Loading